Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Posters

Poster presentations at ISMB 2020 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster. All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2020. There are Q&A opportunities through a chat function to allow interaction between presenters and participants.

Preliminary information on preparing your poster and poster talk are available at: https://www.iscb.org/ismb2020-general/presenterinfo#posters

Ideally authors should be available for interactive chat during the times noted below:

View Posters By Category

Poster Session A: July 13 & July 14 7:45 am - 9:15 am Eastern Daylight Time
Session B: July 15 and July 16 between 7:45 am - 9:15 am Eastern Daylight Time
July 14 between 10:40 am - 2:00 pm EDT
A Comprehensive Analysis of the Phylogenetic Signal in Ramp Sequences in 211 Vertebrates
COSI: EvoCompGen COSI
  • Lauren McKinnon, Brigham Young University, United States
  • Justin Miller, Brigham Young University, United States
  • Michael Whiting, Brigham Young University, United States
  • John Kauwe, Brigham Young University, United States
  • Perry Ridge, Brigham Young University, United States

Short Abstract: Background: Ramp sequences increase translational speed and accuracy when rare, slowly-translated codons are found at the beginnings of genes. Here, the results of the first analysis of ramp sequences in a phylogenetic construct are presented.

Methods: Ramp sequences were compared from 211 vertebrates (110 Mammalian and 101 non-mammalian). The presence and absence of ramp sequences was analyzed as a binary character in a parsimony and maximum likelihood framework. Additionally, ramp sequences were mapped to the Open Tree of Life taxonomy to determine the number of parallelisms and reversals that occurred.

Results: Parsimony and maximum likelihood analyses of the presence/absence of ramp sequences recovered phylogenies that are highly congruent with established phylogenies. Additionally, the retention index of ramp sequences is significantly higher than would be expected due to random chance (p-value = 0). A chi-square analysis of completely orthologous ramp sequences resulted in a p-value of approximately zero as compared to random chance.

Discussion: Ramp sequences recover comparable phylogenies as other phylogenomic methods. Although not all ramp sequences appear to have a phylogenetic signal, more ramp sequences track speciation than expected by random chance. Therefore, ramp sequences may be used in conjunction with other phylogenomic approaches.

A Computational Molecular Evolutionary Approach to Characterize Bacterial Proteins
COSI: EvoCompGen COSI
  • Samuel Chen, Michigan State University, United States
  • Janani Ravi, Michigan State University, United States

Short Abstract: Molecular evolution and phylogeny can provide key insights into pathogenic protein families. Studying how these proteins evolve across bacterial lineages, can help identify lineage-specific and pathogen-specific signatures and variants, and consequently, their functions. We are building a streamlined computational approach for the molecular evolution and phylogeny of target proteins, widely applicable across protein and pathogen families. We applied this approach to examine the phage shock protein (Psp) system and its evolution across all three domains of life (~6500 genomes within bacteria, archaea, and eukaryota). Our process currently starts with one or more proteins and their homologs from thousands of species, along with their detailed functional characterization including domain architectures, genomic neighborhoods, and phyletic maps. We are creating a custom R package to analyze and visualize the conservation of domain architectures and genomic neighborhoods across lineages. To showcase the versatility of this approach, we build a web-app to allow dynamic analysis and visualization. Our ultimate goal is to build an online molecular evolution and phylogeny platform for biologists and an R package for computational biologists to use with their data by simply uploading a protein sequence file.

A Pangenome and Comparative Pathogenomics Workflow for Bacterial Pathogens
COSI: EvoCompGen COSI
  • Karn Jongnarangsin, Michigan State University, United States
  • Janani Ravi, Michigan State University, United States

Short Abstract: Mycobacterial species have a wide range of associated pathologies in human and animal hosts. While members of the Mycobacterium tuberculosis (MTB) complex are causative agents of tuberculosis, non-tuberculous Mycobacteria (NTM) cause pulmonary and chronic pathologies in animals and humans. Systematic genome-wide comparisons across mycobacteria (MTB complex and NTM) are, therefore, warranted to determine the sequence features that culminate in the differential pathogenicities and host-specificities. One highly viable method is to construct pangenomes for the MTB complex and subgroups of NTMs in order to identify genomic features that exist within the core (conserved) or variable/accessory (unique, lineage-specific) genes for each comparison, study species evolution and diversity, and functional annotation. The process of constructing a pangenome from a large number of complete genomes involves multiple, computationally intensive, intermediary steps, including the annotation of constituent genomes and gene grouping based on feature/function. We are incorporating all these steps in a streamlined pangenome construction workflow to compare mycobacterial pathogenic and nonpathogenic species. The comparative pathogenomics and pangenome workflows that we develop can be easily repurposed to address several critical pathogenesis and host-specificity related questions in any bacterial species of interest, e.g., comparing the Staphylococcus aureus and MRSA/VRSA drug-resistant strains.

A Probabilistic Framework for Cell Lineage Tree Reconstruction
COSI: EvoCompGen COSI
  • Hazal Koptagel, KTH Royal Institute of Technology, Science For Life Laboratory, Sweden
  • Seong-Hwan Jun, KTH Royal Institute of Technology, Science For Life Laboratory, Sweden
  • Jens Lagergren, KTH Royal Institute of Technology, Science For Life Laboratory, Sweden

Short Abstract: Single-cell DNA sequencing (ScDNA-seq) technology enables a higher resolution look on the cells and has the potential to uncover the relationship between individual cells. ScDNA-seq is a fundamental tool for the evolutionary studies; however, it also introduces technological artefacts such as uneven coverage, allelic dropout, amplification and sequencing errors.

In this study, we focus on human cells without copy number alterations. We present a Bayesian model based on scDNA-seq, that uncovers the difference between cells with the help of germline single nucleotide variations (gSNVs). The use of gSNVs as reference points enables us to accurately differentiate between somatic point mutations and the technological artefacts, especially the amplification errors. The model outputs a cell-to-cell distance matrix of each analysed pairs of loci, from which we reconstruct the cell lineage tree with bootstrapping and neighbour joining. We evaluate the reconstructed tree with transfer bootstrap expectation scores of branches and the Robinson-Foulds distance to the underlying tree structure. The model is embarrassingly parallel and with the use of dynamic programming and neighbour joining, we can analyse tens of thousands positions in the genome.

The experiments showed high accuracy in tree reconstruction and the identification of subclones.

A Solution to the Labeled Robinson-Foulds Distance Problem
COSI: EvoCompGen COSI
  • Samuel Briand, University of Montreal, Canada
  • Nadia El-Mabrouk, University of Montreal, Canada
  • Samuel Briand, University of Lausanne, Switzerland

Short Abstract: Gene trees are extensively used, notably for inferring the
most plausible scenario of evolutionary events leading to the observed gene family from a single ancestral gene copy. This has important implications towards elucidating the functional relationship between gene copies. For this purpose, reconciliation enables the labeling of internal nodes in the gene tree with the type of events at the origin of gene tree bifurcations.

The variety of phylogenetic inference methods, leading to different and potentially inconsistent trees for the same dataset, warrants the design of appropriate tools for comparing them. While, comparing reconciled gene trees remains a largely unexplored field, a large variety of measures have been developed for comparing unlabeled evolutionary trees. Among them, despite its limitations, the Robinson-Foulds (RF) distance remains the most widely used one, mostly due to its computational efficiency.

In this paper, we report on a Labeled Robinson-Foulds edit distance, which maintains desirable properties such as being computable exactly in linear-time.
Further, we show that this new distance is computable for an arbitrary number of label types, thus making it useful for applications including more label types than speciations and duplications.

ALeS: Adaptive-length spaced-seed design
COSI: EvoCompGen COSI
  • Arnab Mallik, Western University, Canada
  • Lucian Ilie, Western University, Canada

Short Abstract: Similarity search is one of the most critical challenges in bioinformatics. It is time and memory intensive, hence heuristic methods using multiple spaced seeds are commonly employed. A spaced seed is a string of matches interspersed with don’t care positions. Sensitivity, the probability of hitting a similarity region, is used to evaluate their quality. High sensitivity increases the probability of detecting more remote homologs.

We present ALeS: Adaptive-length spaced-seed design, a software program to generate highly sensitivity seeds. It uses two novel optimization techniques: indel optimization and adaptive length. In indel optimization, a random don’t care position is either inserted or deleted, following the hill-climbing approach with sensitivity as cost-function. In the adaptive length algorithm, the seed lengths are modified using previously found good seeds. It consistently outperforms all leading programs used for designing multiple spaced seeds like Rasbhari, AcoSeeD, SpEED, and Iedera.

ALeS will benefit numerous software programs in a variety of areas, including traditional ones, such as PatternHunter for similarity search, SHRiMP and BFAST for read mapping, bestPrimer for designing primers, as well as emerging topics, including alignment-free comparisons and evolutionary distance programs, such as protSpaM, MultiSpaM for phylogeny reconstruction.

Analysis of de novo transcriptome assemblies in sugarcane
COSI: EvoCompGen COSI
  • Verusca Semmler Rossi, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Felipe Vaz Peres, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Diego Mauricio Riaño-Pachón, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil

Short Abstract: Sugarcane (Saccharum spp.) belongs to the family Poaceae and has a large impact in the Brazilian agriculture, industry and economy. Current sugarcane breeding programs aim to generate new varieties in a more targeted and faster manner. However, sugarcane has a very complex genome, due to high levels of ploidy, aneuploidy and polymorphisms among homeologous chromosomes.
The sugarcane transcriptome, using RNASeq mostly, has been accessed over a varied array of conditions, developmental stages, tissues and in several different cultivars around the world. The raw sequencing reads underlying many of these sequencing datasets have been released via public repositories and can be re-used in order to generate new insights. Unfortunately for many of these studies the transcriptome assembly is not publicly available.
We have devised a pipeline to semi-automatically generate de novo transcriptome assemblies, starting form SRA accession numbers. The pipeline includes quality checking and trimming, particularly we have included a step to look for possible genetic material from contaminant organisms in the reads and remove them. Our pipeline also includes quality metrics with BUSCO and Transrate, where the user can select a target reference transcriptome to with available transcriptome assembly for evaluation.

APPRIS - improving principal isoforms
COSI: EvoCompGen COSI
  • Thomas Walsh, Spanish National Cancer Research Centre (CNIO), Spain
  • Fernando Pozo Ocampo, Spanish National Cancer Research Centre (CNIO), Spain
  • Jose Manuel Rodriguez, Centro Nacional de Investigaciones Cardiovasculares, Spain
  • Michael Tress, Spanish National Cancer Research Centre (CNIO), Spain

Short Abstract: The APPRIS Database (appris-tools.org) selects a single representative protein isoform for each coding gene. APPRIS principal isoforms are based on cross species conservation and the preservation of conserved protein structural and functional features. A single main splice isoform reflects the biological reality of most protein coding genes and APPRIS principal isoforms have been shown to be the isoform with the most biological relevance. The APPRIS principal isoform agrees with experimental protein evidence and expert manual annotation groups over more than 99% of measurable coding genes. In addition the exons that produce APPRIS principal isoforms are under selective pressure, unlike the vast majority of alternative isoforms.

APPRIS now includes updated principal and alternative isoforms for chicken, Danio rerio, Drosophila and C. Elegans, as well for human and several mammalian species. APPRIS principal isoforms are provided for Ensembl and RefSeq reference sets. New species can be added on request.

Improvements to the annotation system mean that APPRIS core methods are able to predict a principal for more than 85% of genes with the core methods. The new Trifid algorithm allows APPRIS to produce a score for the likely biological relevance of both principal and alternative isoforms.

Bioinformatics Analysis Revealed Conserved Domains in Genes Causative of Intellectual Disability
COSI: EvoCompGen COSI
  • Anna Liu, Appleby College, Canada
  • Yongsheng Bai, Next-Gen Intelligent Science Training, United States

Short Abstract: Proteins play a crucial role in biological functions and can cause diseases within a specific organ when mutations happen. Intellectual disability (ID) is a brain disorder that limits one’s intellectual abilities and adaptive behavior. In this study, we analyzed protein sequences of ID candidate genes to identify domains—distinct structural units responsible for protein interactions and functions—that are likely related to ID biogenesis.

We conducted functional annotations on 2,066 ID candidate genes gathered from published studies to identify enriched biological themes using DAVID. We selected 91 genes with the GO term ‘neuron’ and retrieved the protein sequences of the top 10 genes that were highly expressed in brain tissues based on GTEx data. Next, we ran the NCBI Batch CD-Search tool for the 9 genes with conserved domain blocks in their protein sequences and identified their super-families. Interestingly, we found two genes (GNAO1 and RAC1) belonging to the same super-family, and they interact with each other as reported by STRING.

Our bioinformatics workflow should be valuable for identifying conserved protein domains related to specific diseases. Future work could focus on determining the 3D structure of domain containing proteins to look for functional sites in guiding targeted drug design.

Characterizing transcriptional regulatory sequences in coronaviruses and their role in recombination
COSI: EvoCompGen COSI
  • Yiyan Yang, National Library of Medicine, National Institutes of Health, United States
  • Wei Yan, National Library of Medicine, National Institutes of Health, United States
  • A. Brantley Hall, University of Maryland, United States
  • Xiaofang Jiang, NLM/NIH, United States

Short Abstract: Novel coronaviruses, including SARS-CoV-2, SARS, and MERS, often originate from recombination events. The mechanism of recombination in RNA viruses is template switching. Coronavirus transcription also involves template switching at specific regions, called transcriptional regulatory sequences (TRS). It is hypothesized but not yet verified that TRS sites are prone to recombination events. Here, we developed a tool called SuPER to systematically identify TRS in coronavirus genomes and then investigated whether recombination is more common at TRS. We ran SuPER on 506 coronavirus genomes and identified 465 TRS-L and 3509 TRS-B. We found that the TRS-L core sequence (CS) and the secondary structure of the leader sequence are generally conserved within coronavirus genera but different between genera. By examining the location of recombination breakpoints with respect to TRS-B CS, we observed that recombination hotspots are more frequently co-located with TRS-B sites than expected.

CladeOScope: elucidating functional interactions via a clade co-evolution prism
COSI: EvoCompGen COSI
  • Tomer Tsaban, Hebrew University, Israel
  • Doron Stupp, Hebrew University, Israel
  • Dana Sherill-Rofe, Hebrew University, Israel
  • Yuval Tabach, The Hebrew University of Jerusalem, Israel

Short Abstract: Mapping co-evolved genes is a powerful approach to uncover functional interactions between genes and to identify genes associated with diseases and pathways. The exponential growth in genomic data requires an updated, broader perspective to infer co-evolutionary insights. Here we suggest improving the data analysis by looking at different evolutionary scales. By analyzing 1028 genomes, divided into 66 monophyletic clades, we were able to improve the functional interaction prediction of human genes based on unique, local co-evolution patterns of different clades. While using clades was previously suggested to be insightful, no wide and thorough clade analysis was performed. To address these issues systematically, we exhaustively analyzed clade signals and developed the “CladeOScope” approach. We show that CladeOScope outperforms other phylogenetic profiling approaches. As an example, we demonstrate how the non-homologous end joining (NHEJ) as well as the UFM1 ubiquitin-like protein pathways could be detected accurately, which is less than optimal when using standard phylogenetic profiling methods.
Our work suggests an essential, systemic improvement to phylogenetic profiling analysis. We believe this approach will dramatically improve functional interaction prediction and our biological understanding of pathways. CladeOScope is freely available at cladeoscope.cs.huji.ac.il.

Clonal genotype and evolutionary structure inference based on Robust PCA method for scDNA-seq data
COSI: EvoCompGen COSI
  • Ziwei Chen, Academy of mathematics and Systems Sciences, Chinese Academy of Sciences, China
  • Lin Wan, Academy of mathematics and Systems Sciences, Chinese Academy of Sciences, China

Short Abstract: The rapid advances in single-cell DNA sequencing (scDNA-seq) technology provide unprecedented insights to understand the evolutionary mechanisms underlying cancer progression and characterize intratumoral heterogeneity. However, the error-prone scDNA-seq data with high experimental noise make their computational downstream analysis challenging. Here, we present a novel algorithm based on the low-rank matrix decomposition method, robust principal component analysis (RPCA) model, to recover subclonal genotypes based on observed genotype matrix (GTM) of either scSNV or scCNV data, and reconstruct the subclonal evolutionary tree. The algorithm fits the GTM recovery problems perfectly based on RPCA model and powerfully handles the GTM with missing entries by utilizing the extended RPCA method. We demonstrate that the power of the algorithm in recovering true GTM and inference of subclonal evolutionary trees under various scenarios, using simulated and real data. We also show the efficiency of the algorithm on applications to large-scale data.

CloudForest: An Integrated and Dynamic Phylogenomic Toolset for the Modern Age
COSI: EvoCompGen COSI
  • Benjamin S. Toups, Department of Biological Sciences, Louisiana State University, United States
  • Thomas McGowan, Minnesota Supercomputing Institute, University of Minnesota, United States
  • Jeremy M. Brown, Department of Biological Sciences and Museum of Natural Science, Louisiana State University, United States
  • Kyle A. Gallivan, Department of Mathematics, Florida State University, United States
  • James C. Wilgenbusch, Minnesota Supercomputing Institute, University of Minnesota, United States

Short Abstract: Variation across inferred gene trees is arguably the most consistent and striking observation from empirical phylogenomic studies, yet many unanswered questions remain about the causes of this variation. One important reason these questions persist is because the field lacks robust, efficient, and reproducible workflows for investigating this variation. To meet this need, we are developing a portable cyberinfrastructure framework called CloudForest. CloudForest will provide researchers with a set of streamlined and integrated tools to explore the structure of large phylogenetic tree sets, including those generated by phylogenomic studies. CloudForest will meet many of the outstanding needs of empirical phylogenomic studies, such as (1) visualizing variation across gene trees, (2) revealing structure in sets of trees (forests), (3) conducting hypothesis tests regarding the causes of gene-tree variation, and (4) detecting genes that may have outlying (and potentially aberrant) histories. By addressing these challenges in a consistent way across computing platforms that range from a desktop computer to university high-performance computing systems to commercial cloud computing, CloudForest will allow biologists to make efficient use of any computational resources at their disposal with workflows appropriate for addressing a variety of important, unresolved questions in both evolutionary biology and other applied fields.

Comparative analysis of Lysine and Arginine biosynthesis pathway in Deinococcus genomes
COSI: EvoCompGen COSI
  • Sankar Mahesh, SASTRA Deemed to be University, India
  • Richa Priyadarshini, Shiv Nadar University, India
  • Ragothaman Yennamalli, SASTRA Deemed to be University, India

Short Abstract: Deinococcus indicus is an arsenic tolerant bacterium isolated from wetlands of North India. D. indicus exhibits growth media-induced morphological changes including cell elongation It is known that ornithine present in the bacterial cell wall is derived from L-lysine and L-arginine biosynthetic pathway. Using 23 Deinococcus genomes, including D. indicus, we used BLAST-P based ortholog identification using D. radiodurans’ genes as query. We identified some BLAST-P hits that had <60% sequence identity and <60% query coverage sharing the same functional annotation. We analyzed three (class I aminotransferase, acetyl-lysine deacetylase, and acetylglutamate/acetylaminoadipate kinase) from L-lysine biosynthesis pathway and three (bifunctional ornithine acetyltransferase or N-acetyl glutamate synthase protein, nitric oxide synthase-like protein, and Acetyl-lysine deacetylase) from L-arginine biosynthesis pathway. Two proteins showed certain structural variations. Specifically, [LysW]-lysine hydrolase protein’s sequence and structure level changes indicate changes in oligomeric conformation, which could likely be a result of divergent evolution. And, the remaining protein (bifunctional ornithine acetyltransferase or N-acetyl glutamate synthase) had their active site pocket positions shifted at the structural level and we hypothesize that it may not perform at the optimal level. Thus, we were able to compare and contrast different Deinococcus species indicating some genes occurring as a result of divergent evolution.

Comparative CAZyme Gene Cluster Analysis Reveals Immense Diversity of Bacterial Glycan Metabolism Systems
COSI: EvoCompGen COSI
  • Catherine Ausland, Department of Biological Sciences, Northern Illinois University, United States
  • Yanbin Yin, Department of Food Science and Technology, Nebraska Food for Health Center, University of Nebraska-Lincoln, United States

Short Abstract: Carbohydrate active enzymes (CAZymes) degrade and synthesize glycans in all organisms on earth. CAZymes work in gene clusters with TonB dependent transporters and starch-binding proteins (“SusCD-like” pair) in Bacteroidetes to degrade glycans and are highly studied because of their contribution to glycan metabolism in gut microbiomes, yet other phyla likely have evolved similar clusters. We propose CAZyme Gene Clusters (CGCs) to generalize across all phyla for novel glycan-substrate gene clusters. Using dbCAN2 and CGC-Finder tools with benchmarked parameters, we identified over 2.5 million CGCs in 74,008 bacterial genomes and performed comparative analyses across phyla. Less than 1% of predicted CGCs contained SusCD-like gene pairs, illustrating the potential diversity of CGCs may be much greater than those studied in Bacteroidetes. CAZyme and transporter composition of CGCs across major gut phyla differed; more glycan-degrading CAZymes in Bacteroidetes, versus more glycan-synthesizing CAZymes in other phyla, perhaps suggesting undiscovered CGCs responsible for synthesis of microbially-produced glycans that may elicit inflammation and metabolic disorders. Predicting CGCs provides informed starting points for novel glycan-systems discovery and augments study of microbes and their effect on host health. Predicted CGCs will be incorporated into our database dbCAN-seq, allowing users to use CGC data for experimental study.

Comparative genomics of Bacillus anthracis, B. cereus and B. thuringiensis using orthogroups and pathogenic islands improved the classification of these closely evolving species.
COSI: EvoCompGen COSI
  • Thifany Vilela Purcena, Centro Universitario de Brasilia, Brazil
  • Roberto Coiti Togawa, Embrapa Genetic Resources and Biotechnology, Brazil
  • Rose Gomes Monnerat, Embrapa Genetic Resources and Biotechnology, Brazil
  • Paulo Roberto Martins Queiroz, Centro Universitario de Brasilia, Brazil
  • Priscila Grynberg, Embrapa Genetic Resources and Biotechnology, Brazil

Short Abstract: The Bacillus cereus sensu lato group comprises twelve species of great medical, economic and biodefense importance. B. anthracis, the etiologic agent of the anthrax infection, B. cereus, that causes gastrointestinal infections and B. thuringiensis, a natural species-specific insecticide through the production of toxic crystals (Cry) are the most important ones. They are gram-negative, spore-forming, aerobic or anaerobic facultative bacteria and are found naturally in the soil. These three species are very similar genetically to each other, making classification based on molecular methods a difficult task. A total of 268,898 proteins from 45 local and public proteomes, including an outgroup was uploaded as input for OrthoFinder software. A total of 11,839 orthogroups comprising 98,9% of the proteins were generated. Only 516 proteins were allocated in 198 species-specific orthogroups. RAxML was applied for phylogenomics analysis. Results were visualized using iiTOL. This approach successfully placed B. anthracis strains in a different node from the others. B. cereus and B. thuringiensis share the same node organized in small species-specific groups. We also used IslandViewer to explore the pathogenic islands in an attempt to complement the molecular information about these bacteria. This approach has shown promise in correctly classifying these species.

Comparison of mitochondrial genetics of domestic and sylvatic Aedes aegypti strains in East Africa to strictly anthropophilic strains in other continents
COSI: EvoCompGen COSI
  • Brian Bwanya, International Centre of Insect Physiology and Ecology, Kenya
  • David Mburu, Pwani University, Kenya
  • David Tchouassi, International Centre of Insect Physiology and Ecology, Kenya

Short Abstract: Advances in genotyping methods have shed more light into mosquito genetics and improved our understanding of vector-borne disease transmission cycles. Outside of Africa, Aedes aegypti, the main vector of re-emerging arboviruses such as dengue, chikungunya, yellow fever, and Zika, exhibits a mainly anthropophilic blood-feeding behaviour, resting and breeding in close association with human settlements. However, in its native African ecology, both domestic (Aedes aegypti aegypti) and sylvatic (Aedes aegypti formosus) lineages occur. Further, the African Ae. aegypti populations have been found to exhibit divergence in typically conserved mitochondrial cytochrome c oxidase subunit 1 (COI) genes and traits of epidemiological importance, including foraging, oviposition, and resting behavior. Such findings raise important questions regarding the genetic diversity and evolutionary history of Ae. aegypti. This study aims to develop a bioinformatics workflow to characterize mitochondrial genetic divergence between Ae. aegypti populations within East Africa and in relation to those available in public databases of domesticated lineages in the Americas and Asia. We will compare mitochondrial sequences of Ae. aegypti sampled near human habitation to those sampled in sylvatic settings. This study will shed light on factors associated with the population biology of Ae. aegypti that may impact vector-borne disease transmission dynamics and risks.

Determining fine-scale temporal variation patterns in evolving populations using a non-parametric test
COSI: EvoCompGen COSI
  • Dongpin Oh, School of Computer Science and Engineering, Pusan National University, South Korea
  • Giltae Song, Pusan national University, South Korea

Short Abstract: Abnormal variations are frequent in clonal genome evolutions of cancers. Such aberrational variations often function as a driver in cancer cell growth. Understanding fundamental evolutionary dynamics underlying these variations in tumor metastasis still is understudied owing to their genetic complexity. Whole-genome sequencing empowers to determine genome variations in the short-term evolution of cell populations. This approach has been applied to evolving populations of unicellular organisms including yeast. It is substantial progress in evolutionary genomics to examine sequence changes at such fine-scale resolution. However, existing statistical tests for analyzing variation temporal changes in multiple time-points are limited to identify the full spectrum of intermediate changes.
We designed a non-parametric statistical approach inspired by the Kolmogorov-Smirnov test using Monte-Carlo simulation and integrated this into a software tool for determining the significant variation patterns in fine-scale temporal resolution in experimental evolution studies using whole-genome sequencing. We validated our method and compared our method with two existing methods: CMH (Cochran-Mantel-Haenszel) and BBGP (Beta-Binomial–Gaussian-Process) tests using in-vitro and insilico data of fruit fly short-term evolution. We analyzed S. cerevisiae W303 strain genomes sequenced at 12 time-points in 1000 generations using our software tool and identified novel variations having interesting patterns related to reference genes.

DYNAMIC IDENTIFICATION OF VIRAL TRANSMISSION EPICENTERS (DYNAMITE)
COSI: EvoCompGen COSI
  • Brittany Rife Magalis, University of Florida, United States
  • Marco Salemi, University of Florida, United States
  • Simone Marini, University of Florida, United States
  • Mattia Prosperi, University of Florida, United States

Short Abstract: Molecular data analysis is invaluable in understanding the overall behavior of a rapidly spreading virus population when epidemiologic surveillance is problematic. However, no method exists to date to identify and characterize the dynamics of individual transmission clusters within an outbreak for help in more targeted approaches (i.e., epicenter-focused) in extinguishing spread. For this purpose, we have developed a phylogeny-based tool (DYNAMITE) to infer putative transmission networks applicable to all measurably evolving viruses and which utilizes tree-based models to quantify growth characteristics over time with epidemiological relevance, such as the effective reproductive number. DYNAMITE uses lineages through time to estimate the exponential growth phase of the underlying epidemic, during which the estimated population history is representative of the number of infected individuals, and corresponding genetic distances the standard mutation accumulation during the transmission period. The maximum genetic distance during this time is used as the threshold for allowance of classification of internal nodes as unsampled individuals throughout the remainder of the phylogeny and thus the definition of a cluster, beginning at well-supported nodes (e.g., using bootstrapping). This novel algorithmic approach is to be validated using simulation of minimally exchangeable sub-populations within an SIR model of a growing epidemic.

Gaps and runs in syntenic alignments
COSI: EvoCompGen COSI
  • Zhe Yu, University of Ottawa, Canada
  • Chunfang Zheng, University of Ottawa, Canada
  • David Sankoff, University of Ottawa, Canada

Short Abstract: Gene loss is the obverse of novel gene acquisition by a genome through a variety of evolutionary processes. It serves a number of functional and structural roles, compensating for the energy and material costs of gene complement expansion.

A type of gene loss widespread in the lineages of plant genomes is “fractionation” after whole genome doubling or tripling, where one of a pair or triplet of paralogous genes in parallel syntenic contexts is discarded. The detailed syntenic mechanisms of gene loss, especially in fractionation, remain controversial.

We focus on the the frequency distribution of gap lengths (number of deleted genes – not nucleotides) within syntenic blocks calculated during the comparison of chromosomes from two genomes. We mathematically characterize a simple model in some detail and show how it is an adequate description neither of the Coffea arabica subgenomes nor its two progenitor genomes. We find that a mixture of two models, a random, one-gene-at-a-time, model and a geometric-length distributed excision for removing a variable number of genes, fits well.

Gene annotation refinement software using synteny based mapping
COSI: EvoCompGen COSI
  • Hoyong Lee, Pusan National University, South Korea
  • Giltae Song, Pusan national University, South Korea

Short Abstract: High throughput next-generation sequencing (NGS) reduces the generation cost of genome data substantially. To apply the NGS data for various genetics studies, the sequencing data is assembled into a genome sequence and annotated into genes. Gene annotation, one of fundamental steps to understand the functions of each gene, is to determine the location of the gene and coding regions. There exist several gene annotation tools, but they have still limitations in terms of ambiguity issues in gene annotation steps. Most annotation tools are also quite difficult for novice users to install and apply for their studies.
We propose a user-friendly and practically usable gene annotation software pipeline. To this end, the ambiguity problems are resolved using synteny mapping information. The performance of our software tool is evaluated using benchmark datasets such as the sequence of Saccharomyces cerevisiae S288C strain as well as other strain data. We believe our tool improves the accuracy of gene annotations so that it can substantially reduce the efforts and time required for manual curation in genome annotation. Our software package is released as an installation script and a Docker image so that users can easily install and apply for their own sequence data.

Graph Splitting: A Graph-Based Approach for Superfamily-Scale Phylogenetic Tree Reconstruction
COSI: EvoCompGen COSI
  • Motomu Matsui, The University of Tokyo, Japan
  • Wataru Iwasaki, The University of Tokyo, Japan

Short Abstract: A protein superfamily contains distantly related proteins that have acquired diverse biological functions through a long evolutionary history. Phylogenetic analysis of the early evolution of protein superfamilies is a key challenge because existing phylogenetic methods show poor performance when protein sequences are too diverged to construct an informative multiple sequence alignment (MSA). Here, we propose the Graph Splitting (GS) method, which rapidly reconstructs a protein superfamily-scale phylogenetic tree using a graph-based approach. Evolutionary simulation showed that the GS method can accurately reconstruct phylogenetic trees and be robust to major problems in phylogenetic estimation, such as biased taxon sampling, heterogeneous evolutionary rates, and long-branch attraction when sequences are substantially diverge. Its application to an empirical data set of the triosephosphate isomerase (TIM)-barrel superfamily suggests rapid evolution of protein-mediated pyrimidine biosynthesis, likely taking place after the RNA world. Furthermore, the GS method can also substantially improve performance of widely used MSA methods by providing accurate guide trees. The GS method is freely available at our website: gs.bs.s.u-tokyo.ac.jp/

Reference:
Motomu Matsui and Wataru Iwasaki. Systematic Biology, 69, 265-279. (2020)
doi.org/10.1093/sysbio/syz049

Highly-regulated and diverse NTP-based biological conflict systems with implications for emergence of multicellularity
COSI: EvoCompGen COSI
  • Gurmeet Kaur, NCBI, NIH, United States
  • A Maxwell Burroughs, NCBI, NIH, United States
  • Lakshminarayan M Iyer, NCBI, NIH, United States
  • Aravind L., NCBI, NIH, United States

Short Abstract: Multicellular organizations are prone to infections even if a single cell is infected. We reveal novel highly-regulated chaperone-based systems that are likely used as survival tactic by prokaryotes with complex lifecycles. These architecturally-analogous systems have constant core modules coupled with highly-variable effector modules which is reminiscent of known biological conflict systems and co-evolutionary arms-race. The constant component is either an ATPase/GTPase and/or a peptidase that is activated in response to an invasive entity and causes effector deployment, that is additionally regulated by proteolytic processing or binding of nucleotide-derived signal. A third component senses invasive entities and transmits the signal. Effectors either: target invasive nucleic-acids or proteins; are inactive counterparts of host proteins that mediate decoy interactions with invasive molecules; or form macromolecular assemblages to cause host cell-death or containment of invasive entity. These apoptotic and immunity properties displayed by systems in phylogenetically-disparate multicellular prokaryotes are suggestive of evolutionary convergence for kin viability in multicellular organizations. Comparable protein domains appear to have organized into systems based on common principles in eukaryotic apoptosis. Thus, a similar operational “grammar” and shared “vocabulary” of protein domains in sensing and limiting infections during the multiple emergences of multicellularity across the tree of life is seen.

Identifying new photosynthesis genes by massive comparative genomics in Arabidopsis thaliana
COSI: EvoCompGen COSI
  • Elad Sharon, The Hebrew University of Jerusalem, Israel
  • Alexander Vainstein, The Hebrew University of Jerusalem, Israel
  • Yuval Tabach, The Hebrew University of Jerusalem, Israel

Short Abstract: Photosynthesis is a highly conserved process, by which plants and other photosynthetic organisms turn solar energy, water, and carbon dioxide into yield and oxygen. This highly conserved process is a key interest for increasing yield production globally, mark improving photosynthesis efficiency as a promising target. Here, we are using Phylogenetic profiling (PP) method to point new potential genes, being involved in photosynthesis. PP is based on correlated occurrence and absence pattern of genes along evolution, which tend to reflect their mutual function in biological processes or complexes. The proteome of Arabidopsis thaliana was clustered according to its evolutionary pattern in 1671 eukaryotes, out of 170 green plants. The main components of photosynthesis were found in genes clusters with a specific evolutionary pattern, represented mostly by photosynthetic organisms. These clusters significantly enriched with the main components, regulation and assembly of the photosystem machinery, the components of carbon fixation process and other known photosynthesis-related processes. Importantly these clustered also contained unannotated genes. Overall our computational evolutionary analysis suggest additional genes in the photosynthesis-related processes. The identification of these genes might open a new ways to improve photosynthesis efficiency for increase in the global yield.

iJump: a fast tool for tracking bacterial mobile elements rearrangements in course of adaptive laboratory evolution
COSI: EvoCompGen COSI
  • Semen Leyn, Sanford Burnham Prebys Medical Discovery Institute, United States

Short Abstract: Mobile elements rearrangements in bacteria may lead to gene inactivation or deregulation providing an important contribution to adaptation. While the challenge of mapping these rearrangements was addressed for individual genomes, no efficient tools are available for tracking their dynamics in evolving populations, such as in adaptive laboratory evolution (ALE).
We are using ALE in a custom-engineered continuous culture device (morbidostat) to study dynamics and mechanisms of antibiotic resistance in major gram-negative bacterial pathogens. Acquisition of mutations in evolving populations is monitored by deep sequencing of samples in time-series.
To observe evolutionary paths driven by “jumping” of IS elements we have developed the iJump software, which is using soft-clipped reads from the SAM/BAM alignment extracted from the boundaries of known mobile elements to find new junctions and estimate their frequencies. The performance of iJump was first tested on a simulated data set where it showed 1-4% error in frequency estimation. Application of the iJump tool to our ALE studies with Escherichia coli, Acinetobacter baumannii and Pseudiomonas aeruginosa confirmed its practical utility and revealed IS-driven bacterial adaptations to known antibiotics and novel drug candidates. The results were verified by Nanopore-based sequencing and MIC determination of selected individual clones. Software available at github.com/sleyn/ijump

In silico analysis of sugarcane genes involved in transcriptional regulation and the metabolism of carbohydrates
COSI: EvoCompGen COSI
  • Verusca Semmler Rossi, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Felipe Vaz Peres, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Diego Mauricio Riaño-Pachón, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil

Short Abstract: Modern sugarcane cultivars (Saccharum spontaneum x Saccharum officinarum) produce 80% of the world's sugar. It is one of the main tropical crops, and highly efficient biomass producer in the form of sucrose and fiber, which can be turned into ethanol and electricity, respectively. The sugarcane genome is very complex with a monoploid genome of 900Mbp, highly polymorphic, with high levels of ploidy and showing also aneuploidy. Advances in sequencing technologies have allowed the generation of two draft genomes, still highly fragmented, of the Brazilian commercial genotype SP80-3280, that have not been compared and integrated to date. We have identified the putative full and non-redundant set of transcripts coding for Transcription Associated Proteins (TAPs) and Carbohydrate Active Enzymes (CAZymes) in the genome of this cultivar, exploiting two available draft genome sequences for the same cultivar. We are further exploiting public RNASeq datasets from sugarcane to identify clusters of co-expressed genes involving TAPs and CAZymes. So far, we divided TAPs into Transcription Factor (TFs) families and Other Transcriptional Regulators (OTRs), and found 99 families of TFs, represented by 10.724 genes and 96 families of OTRs represented by 2.709 genes. Similarly, we identified 503 families of CAZymes represented by 28.881 genes.

Integrated synteny- and similarity-based inference on the polyploidization-fractionation cycle
COSI: EvoCompGen COSI
  • Yue Zhang, University of Ottawa, Canada
  • Zhe Yu, University of Ottawa, Canada
  • Chunfang Zheng, University of Ottawa, Canada
  • David Sankoff, University of Ottawa, Canada

Short Abstract: Two orthogonal approaches to the study of fractionation (duplicate gene loss after polyploidization) focus on the decrease over time of the number of surviving duplicate pairs, on the one hand, and on the pattern of syntenically consecutive pairs lost at a deletion event, on the other. Here we explore a synergy between the two approaches that greatly enlarges the scope of both.

In the branching process approach to accounting for the distribution of gene pair similarities, the inference possibilities are minimal, since there is only one degree of freedom for each replication event. It is only by transcending the distribution of gene pair similarities and bringing other data to bear can we increase the number of parameters of the branching process that can be estimated.

We greatly enlarged the possibilities of estimating parameters this model of the replication-fractionation cycle, by considering the singletons within synteny blocks, by deriving theoretical constraints among the retention rates, and by correcting for erosion of synteny blocks over time.

Joint clustering of single cell sequencing and fluorescent in situ hybridization data to infer tumor copy number phylogenies
COSI: EvoCompGen COSI
  • Xuecong Fu, Carnegie Mellon University, United States
  • Haoyun Lei, Carnegie Mellon University, United States
  • Russell Schwartz, Carnegie Mellon University, United States

Short Abstract: Aneuploidy, and associated whole genome duplication (WGD) events, are common features of cancers associated with poor outcomes. Phylogenetic methods for reconstructing clonal evolution from genomic data, though, so far have limited ability to resolve tumor evolution via ploidy changes. This occurs in part because single cell DNA-sequencing (scSeq), which has been crucial to developing detailed profiles of clonal evolution, is poorly suited to studying ploidy changes and WGD. Multiplex interphase fluorescence in situ hybridization (miFISH) provides a more unambiguous signal of single-cell ploidy changes but is limited to profiling small numbers of single markers. Here, we develop a joint clustering method to combine these two data sources. We develop a probabilistic framework to maximize the probability of latent variables given the pre-clustered datasets, which we optimize via Markov chain Monte Carlo sampling combined with linear regression. After subclonal profiles for scSeq and the ploidy information are identified, we build multiple trees from subsets of scSeq features by FISHtrees, a phylogeny algorithm for ploidy-aware phylogenetics from small numbers of FISH probes, from which we derive a consensus tree. We validate and demonstrate the method using real and semi-simulated data derived from two glioblastoma cases profiled by both scSeq and miFISH.

Modeling gene expression evolution with EvoGeneX uncovers differences in evolution of species, organs and sexes
COSI: EvoCompGen COSI
  • Soumitra Pal, National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, United States
  • Brian Oliver, Laboratory of Cellular and Developmental Biology, National Institute of Diabetes and Digestive and Kidney Diseases, United States
  • Teresa Przytycka, National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, United States

Short Abstract: While DNA sequence evolution is well-studied, an equally important factor, evolution of gene expression, is yet to be fully understood. The availability of recent tissue/organ-specific expression datasets spanning several organisms across the tree of life, including our new data from Drosophila, has enabled detailed studies of expression evolution.

We introduce EvoGeneX, a computational method that complements existing models for expression evolution across species using stochastic processes, maximum likelihood-estimation and hypothesis-testing to differentiate three modes of evolution: 1) neutral: Brownian Motion, 2) constrained: when expression evolved toward an optimum (Ornstein-Uhlenbeck process), and 3) adaptive: when expression in different branches of species tree evolved toward different optima. Additionally, EvoGeneX incorporates biological replicates for within-species variations. We also introduce a novel comparative analysis of evolution across tissues and sexes using Michalis-Menten(MM) curves.

In our simulation EvoGeneX significantly outperformed the currently available method on false discovery rate. On expression data across organs, species, and sexes of Drosophila, our generic method revealed a large fraction of constrained genes including genes constrained in all organs and sexes. Our MM-based approach revealed striking differences in evolutionary dynamics in gonads. Finally, EvoGeneX revealed compelling examples of adaptive evolution, including odor binding proteins, ribosomal proteins, and amino acid metabolism.

Occurrence of XPD helicase in Galliform birds
COSI: EvoCompGen COSI
  • Rayana Feltrin, Universidade Federal de Santa Maria, Brazil
  • Ana Segatto, Instituto Federal do Rio Grande do Sul, Brazil
  • Tiago de Souza, TauGC Bioinformatics, Brazil
  • André Schuch, Universidade Federal de Santa Maria, Brazil

Short Abstract: Nucleotide excision repair (NER) pathway is the most versatile DNA repair mechanism as it removes a wide variety of structurally unrelated DNA lesions. Among some of the main NER components, the XPD helicase, which integrates transcription fator IIH (TFIIH), is one of the most evolutionarily conserved proteins, being present even in Archaea. However, according to our previous work, a canonical XPD ortholog is missing in Gallus gallus. To better investigate this, we performed a refined search of XPD in G. gallus and also searched for XPD orthologs in genomes of Galliformes, Tinamiformes and Struthioniformes by using similarity and structural criteria. Therefore, we found that the protein DDX11 may be replacing XPD function in chicken. We also identified likely occurrences of XPD in only three genomes out of 19, belonging to species of Tinamiformes and Struthioniformes, that is, the base of the bird phylogeny. In addition, we obtained search results with high sequence identities, but very low coverages. Thus, we suppose that there might be occurred a progressive loss of the XPD sequence from the base of the bird phylogeny throughout the evolution of Galliformes until chicken, which reinforces the importance of in silico studies to open perspectives for functional investigations.

On quantifying evolutionary importance of protein sites: A tale of two measures
COSI: EvoCompGen COSI
  • Avital Sharir-Ivry, McGill University, Israel
  • Yu Xia, McGill University, Canada

Short Abstract: A key challenge in evolutionary biology is the quantification of selective pressure on proteins and other biological macromolecules at single-site resolution. The evolutionary importance of a protein site under purifying selection is typically measured by the degree of conservation of the protein site itself. A possible alternative measure is the strength of the site-induced conservation gradient in the rest of the protein structure. Here, we show that despite major differences, there is a linear relationship between the two measures such that more conserved protein sites also induce stronger conservation gradient in the protein structure. This linear relationship is universal as it holds for different types of proteins and functional sites. Our results show that generally, the selective pressure acting on a functional site percolates through the rest of the protein via residue-residue contacts. Surprisingly however, catalytic sites in enzymes are the principal exception to this rule. Catalytic sites induce significantly stronger conservation gradients in the rest of the protein than expected from the degree of conservation of the site alone. The uniquely stringent requirement for the active site to selectively stabilize the transition state of the catalyzed chemical reaction imposes additional selective constraints on the rest of the enzyme.

Reconstructing Tumor Evolutionary Histories and Clone Trees in Polynomial-time with SubMARine
COSI: EvoCompGen COSI
  • Linda K. Sundermann, University of Toronto, Canada
  • Jeff Wintersinger, University of Toronto, Canada
  • Jens Stoye, Bielefeld University, Germany
  • Quaid Morris, Memorial Sloan Kettering Cancer Centre, United States
  • Gunnar Rätsch, ETH Zürich, Switzerland

Short Abstract: Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data. Current methods do not effectively characterize this uncertainty, and cannot scale to cancers with many subclonal populations.

In this work we introduce a partial clone tree that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined relationships. Also, we define a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. We describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR with specific guarantees. We also extend SubMARine to work with subclonal copy number aberrations.

We show, both on simulated and a real lung cancer dataset, that SubMARine runs in less than 70 seconds, and that the subMAR equals the MAR in > 99.9% of cases where only a single tree exists.

SubMARine is available at github. com/morrislab/submarine.

Reference genome sequence-based read clustering
COSI: EvoCompGen COSI
  • Mikang Sim, Konkuk University, South Korea
  • Jongin Lee, Konkuk University, South Korea
  • Daehong Kwon, Konkuk University, South Korea
  • Daehwan Lee, Konkuk University, South Korea
  • Jaebum Kim, Konkuk University, South Korea

Short Abstract: There have been many studies for developing clustering algorithms to aid analysis problems for different types of data. Next-generation sequencing data, called reads, is one of the good targets of clustering because accurately constructed read clusters can dramatically improve the quality of downstream analyses, such as genome assembly. In addition, recent accumulation of high-quality genome assemblies of many species has provided unprecedented challenges and opportunities for read clustering. We present a new read clustering algorithm by utilizing genome assemblies of related species, called references. Given paired-end reads of a target species, and genome sequences of references, our algorithm (i) constructs syntenic regions among reference genomes, (ii) groups reads that are mapped to the same syntenic region as the member of the same cluster, (iii) groups unmapped reads based on their proximity in reference genomes, and (iv) further merges generated clusters using the pair information of paired-end reads. The performance of our algorithm was confirmed by evaluation using simulation-based read sequences of yeast and human, and it also successfully applied to generate high-quality assembly of a human genome. We believe that our clustering algorithm will serve as a more valuable tool as more high-quality reference genome assemblies are accumulated.

Splicing-structure-based selection of protein isoforms improves the accuracy of gene tree reconstruction
COSI: EvoCompGen COSI
  • Esaie Kuitche Kamela, Université de Sherbrooke, Canada
  • Wend-Yam D. Davy Ouédraogo, Université de Sherbrooke, Canada
  • Marie Degen, Université de Sherbrooke, Canada
  • Shengrui Wang, Université de Sherbrooke, Canada
  • Aida Ouangraoua, Université de Sherbrooke, Canada

Short Abstract: Constructing accurate gene trees is important, as gene trees play a key role in several biological studies, such as species tree reconstruction, gene functional analysis and gene family evolution studies. Although several methods have provided large improvements in the construction and the correction of gene trees by making use of the relationship with a species tree in addition to multiple sequence alignments, there is still room for improvement on the accuracy of gene trees and the computing time. In particular, accounting for alternative splicing that allows eukaryote genes to produce multiple transcripts and proteins per gene is a way to improve the quality of multiple sequence alignments used to reconstruct gene trees.Current methods for gene tree reconstruction usually make use of a set of transcripts composed of one representative transcript per gene, to generate multiple sequence alignments which are then used to estimate gene trees. In this work, we present two new splicing-structure-based methods to estimate gene trees based on wisely selecting an accurate set of homologous transcripts based on their splicing structure to represent the genes of a gene family. The results show that the new methods compare favorably with the currently most used gene tree construction methods.

TAGOPSIN: A tool for retrieving taxa-specific protein structural and functional information
COSI: EvoCompGen COSI
  • Eshan Bundhoo, University of Mauritius, Mauritius
  • Anisah Ghoorah, University Of Mauritius, Mauritius
  • Yasmina Jaufeerally-Fakim, University of Mauritius, Mauritius

Short Abstract: Extensive biological data are currently readily available in public databases, making it possible for a specific research problem to be addressed by multi-database search and data retrieval. However, getting the correct interconnected data from a number of different databases can be cumbersome. Here, we present TAGOPSIN (TAxonomy, Gene, Ontology, Protein, Structure INtegrated), a command line program written in Java whose purpose is to retrieve data from seven public biological repositories and assemble them in a single data warehouse managed by PostgreSQL.
Using the object-oriented paradigm, organisms, genomes, genes, proteins, biological functions, protein domain families and protein 3D structures are modelled as real-world interrelated entities. Accordingly, Tagopsin retrieves selected data from their respective FTP or HTTP servers and gathers them in a unified local repository.
The program was tested with several model prokaryotic organisms. For example, about 1.1 million coding nucleotide sequences and 1,706 PDB entries were retrieved for 264 strains of Mycobacterium tuberculosis. TAGOPSIN constitutes a valuable tool in molecular evolutionary and other comparative genomics studies as well as structure-based studies.
TAGOPSIN is released under the GNU General Public License and is available as a JAR file at .

The effect of albendazole on the microbial community of rumen, abomasum and feces of sheep infected with Haemonchus contortus and Trichostrongylus colubriformes
COSI: EvoCompGen COSI
  • Emiliana Manesco Romagnoli, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Helder Louvandini, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Patricia Spoto Corrêa, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Tsai Siu Mui, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Adibe Luiz Abdalla, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Diego Mauricio Riaño-Pachón, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil

Short Abstract: In Brazil, sheep are often infected by nematodes, mainly the species Haemonchus contortus and Trichostrongylus colubriformes, causing large economic impact. These infestations are usually treated with the antiparasitic drug albendazole. In this study we wanted to evaluate the changes in microbiome composition and function after treating nematode infestations with albendazole. Six animals, all were infected with H. contortus and T. colubriformes, were divided into two groups: i) control (no drug); ii) 0.5ml of albendazole after 21 days of infection. The animals were slaughtered, samples of rumen, abomasum and feces were collected for DNA extraction. We generated a Illumina Nextera DNA library for each sample, sequenced on a HiSeq 2500, generating in average 20,000,000 PE reads 2x100bp per sample. For taxonomic assessment, we used MetaPhlAn on the quality trimmed reads. For functional assessment we assembled the metagenomes of each treatment using Megahit, IDBA-UD and metaSPAdes, predicted protein-coding genes with Prodigal and generated a non-redundant protein set with CD-HIT. We focused in CAZymes using dbCAN and proteins involved in antimicrobrial resistance using Resfams. Our results will contribute to a better understanding of the effect of albendazole on the ruminant microbiome.

The functional importance of tandem exon duplications
COSI: EvoCompGen COSI
  • Laura Martinez-Gomez, Centro Nacional de Investigaciones Oncológicas, Spain
  • Michael L. Tress, Centro Nacional de Investigaciones Oncológicas, Spain
  • Fernando Pozo Ocampo, Centro Nacional de Investigaciones Oncológicas, Spain
  • Thomas A. Walsh, Centro Nacional de Investigaciones Oncológicas, Spain
  • Federico Abascal, Wellcome Trust Sanger Institute, United Kingdom

Short Abstract: Alternative splicing and gene duplication have been proposed as two of the major mechanisms providing protein functional diversity. From the point of view of the protein, there are essentially just two types of alternative splicing, indels and substitutions. Protein sequence substitutions can be distinguished by whether or not they arose from tandem exon duplications.

These substitutions make up only a small proportion of the annotated substitutions in the human genome. However, it has been shown that alternative isoforms generated from homologous duplicated exons are significantly over-represented in mass spectrometry studies, have very subtle effects in terms of protein folding disruption and a number are implicated in development and disease.

We manually retrieved 248 pairs of homologous exons. We estimated the duplication dates and we found that almost 90% of them were conserved more than 10 times as many as other alternative exons. We and detected peptides for more than 50% of the isoforms generated from these homologous exons (compared to fewer than 0.1% for all other types of splice events).

Our results suggest that the generation of alternative isoforms from exon duplications, while rare, is likely to be an important means of generating functional diversity in eukaryotes.

The Pseudocercospora ulei genome assembly reveals a significant size expansion mediated by specific transposable elements
COSI: EvoCompGen COSI
  • Sandra Milena González Sayer, Biotechnology Institute, National University of Colombia, Colombia
  • Ursula Oggenfuss, Université de Neuchâtel, Switzerland
  • Ibonne Aydee García, Biotechnology Institute, National University of Colombia, Colombia
  • Fabio Ancizar Aristizabal, Biotechnology Institute, National University of Colombia, Colombia
  • Daniel Croll, Université de Neuchâtel, Switzerland
  • Diego Mauricio Riaño-Pachón, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil

Short Abstract: Natural rubber is a biopolymer commercially produced by Hevea brasiliensis that represents the raw material for the manufacture of products for medical and automotive industries. South American Leaf Blight represents the main threat for Latin-American rubber tree plantations. The ascomycete fungus Pseudocercospora ulei is the causal agent of SALB disease. Despite its agronomic importance the knowledge around the biological and pathogenic behaviour of this pathogen is minimal. Our main goal was to generate a high-quality P. ulei genome sequence and annotation, with special focus on potential pathogenicity mechanisms. We carried out a whole-genome shotgun sequencing using long (PacBio and ONT) and short reads (Illumina), and assembled with CANU using PacBio data, followed by polishing with Pilon, and scaffolding with Links with ONT data. Our genome has an size of 93.8 Mb, N50 of 2.8 Mb, with 214 scaffolds, and completeness evaluated by BUSCO of 97.5%. P. ulei genome shows an exceptional content of repetitive sequences (80% of the genome size), which are mostly Class I transposable elements. Genome annotation revealed 9898 protein-coding gene loci. Our results indicate that this genome underwent a size expansion via repetitive elements, and this can also play an essential role in fungal adaptation and pathogenicity.

The unique mRNA decapping enzyme ALPH1 of trypanosomes
COSI: EvoCompGen COSI
  • Bridget P. Bannerman, University of Cambridge, United Kingdom
  • Susanne Kramer, University of Wuerzburg, Germany

Short Abstract: The 5′ ends of eukaryotic mRNAs are modified with a m7G cap as a protection against uncontrolled decay. mRNA decay is typically initiated by the shortening of the poly(A) tail, followed by degradation of the mRNA in either 5′ to 3′ or 3′ to 5′ direction. In the 5′-to-3′ decay pathway, the m7G cap is removed by the nudix domain protein Dcp2 along with a specialized multiprotein factory called the decapping complex.
Trypanosomes lack homologues to all decapping complex proteins and we have recently identified an ApaH-like phosphatase (TbALPH1) as the major mRNA decapping enzyme of trypanosomes.

TbALPH1 is essential and fulfils all in vitro and in vivo criteria of a decapping enzyme. ApaH like phosphatases are present in eukaryotes of all kingdoms, but the trypanosome enzyme is the first ApaH like phosphatase with an assigned function.

An extensive evolutionary analysis of all ApaH like phosphatases revealed that its function as a decapping enzyme is most likely unique to the kinetoplastida. Trypanosomes left the eukaryotic mainstream early, and this highly unusual mRNA decapping enzyme is another prominent example for the non-conventional biology of trypanosomes. Given the absence of ApaH like phosphatases from humans, ALPH1 is a drug target candidate.

TRIFID: determining functional isoforms
COSI: EvoCompGen COSI
  • Fernando Pozo Ocampo, Spanish National Cancer Research Centre (CNIO), Spain
  • Laura Martinez Gomez, Centro Nacional de Investigaciones Oncológicas, Spain
  • Michael Tress, Spanish National Cancer Research Centre, Spain

Short Abstract: Alternative Splicing (AS) of messenger RNA can generate a wide variety of mature RNA transcripts and this expression is confirmed by experimental transcript evidence. In theory these transcripts could generate protein isoforms with diverse cellular functions. However, while peptide evidence strongly supports a main protein isoform for the vast majority of coding genes, it is not clear what proportion of these AS isoforms form stable functional proteins. In fact reliable proteomics experiments have found little evidence of alternative spliced proteins, so the number of stably folded/functional proteins produced by AS remains a mystery.

We have developed a computational method (TRIFID) for the classification of splice isoform functional importance. This machine-learning algorithm was trained on reliable peptide evidence from proteomics analyses and classifies biologically important splice isoforms with high confidence. The algorithm ranks the most significant biological splice isoforms and we show that the highest scoring alternative exons are actually under selection pressure, unlike the vast majority of alternative exons. TRIFID can predict functional isoforms for any well-annotated eukaryotic species. The method will generate valuable insights into the cellular importance of alternative splicing.

What is the structure of the ‘evolutionary model space’ for proteins?
COSI: EvoCompGen COSI
  • Edward Braun, Univeristy of Florida, United States
  • Akanksha Pandy, University of Florida, United States
  • Gabrielle Scolaro, University of Florida, United States
  • Matthew Chang, University of Florida, United States
  • Emily Gordon, University of Florida, United States

Short Abstract: Estimates of amino acid exchangeabilities are central to to models of protein evolution; there have been many efforts to estimate those parameters using from large numbers of proteins. Although models trained in this way can be useful for phylogenetic analyses, they provide limited information about the process of protein evolution. Recent studies have revealed that patterns of protein evolution (assessed using amino acid exchangeability parameters) vary across the tree of life; in other words, the processes underlying protein evolution are non-homogeneous. However, optimizing a 20-state non-homogenous model requires the estimation of many. of free parameters. Thus, it represents a challenging computational problem. There are two straightforward ways to simplify this problem: 1) estimate parameters for the best-approximating time-reversible model using restricted taxon sets; or 2) reduce the number of free parameters by constraining the model using biochemical information. Different protein structural environments are also associated with distinct amino acid substitution patterns; this results in a mixture of underlying models that also differ among taxa. Efforts to use the approaches described above, in combination with structural partitioning, to understand the space of protein evolutionary models will be described.

When, why and how tumour clonal diversity predicts survival
COSI: EvoCompGen COSI
  • Robert Noble, Department of Biosystems Science and Engineering, ETH Zurich, Switzerland
  • John Burley, Institute at Brown for Environment and Society, Brown University, United States
  • Cécile Le Sueur, Department of Biosystems Science and Engineering, ETH Zurich, Switzerland
  • Michael Hochberg, Institut des Sciences de l’Evolution, France

Short Abstract: Intratumour heterogeneity holds promise as a prognostic biomarker in multiple cancer types. However, the relationship between this marker and its clinical impact is mediated by an evolutionary process that is not well understood. We employ a spatial computational model of tumour evolution to assess when, why and how intratumour heterogeneity can be used to forecast tumour growth rate and patient survival. We identify three conditions that can lead to a positive correlation between clonal diversity, subsequent tumour growth rate, and clinical progression: diversity is measured early in tumour development; selective sweeps are rare; and/or tumours vary in the rate at which they acquire driver mutations. Opposite conditions typically lead to negative correlation. Our results further suggest that prognosis can be better predicted on the basis of both clonal diversity and genomic instability than either factor alone. Nevertheless, we find that, for predicting tumour growth and progression, clonal diversity alone is likely to underperform conventional measures of tumour stage and grade. We thus offer explanations – grounded in evolutionary theory – for empirical findings in various cancers. Our work informs the search for new prognostic biomarkers and contributes to the development of predictive oncology.